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Abstract 


Corrupting the input and hidden layers of deep neural networks (DNNs) with 
multiplicative noise, often drawn from the Bernoulli distribution (or ‘dropout’), 
provides regularization that has significantly contributed to deep learning’s suc¬ 
cess. However, understanding how multiplicative corruptions prevent overfitting 
has been difficult due to the complexity of a DNN’s functional form. In this paper, 
we show that when a Gaussian prior is placed on a DNN’s weights, applying mul¬ 
tiplicative noise induces a Gaussian scale mixture, which can be reparameterized 
to circumvent the problematic likelihood function. Analysis can then proceed by 
using a type-II maximum likelihood procedure to derive a closed-form expression 
revealing how regularization evolves as a function of the network’s weights. Re¬ 
sults show that multiplicative noise forces weights to become either sparse or in¬ 
variant to rescaling. We find our analysis has implications for model compression 
as it naturally reveals a weight pruning rule that starkly contrasts with the com¬ 
monly used signal-to-noise ratio (SNR). While the SNR prunes weights with large 
variances, seeing them as noisy, our approach recognizes their robustness and re¬ 
tains them. We empirically demonstrate our approach has a strong advantage over 
the SNR heuristic and is competitive to retraining with soft targets produced from 
a teacher model. 


1 Introduction 

Training deep neural networks (DNNs) under multiplicative noise, by introducing a random variable 
into the inner product between a hidden layer and a weight matrix, has led to significant improve¬ 
ments in predictive accuracy. Typically the noise is drawn from a Bernoulli distribution, which is 
equivalent to randomly dropping neurons from the network during training, and hence the prac¬ 
tice has been termed dropout maim Recent work Q3E0I suggests equivalent, if not better, 
performance using Beta or Gaussian distributions for the multiplicative noise. Thus, in this paper 
we consider multiplicative noise regularization broadly, not limiting our focus just to the Bernoulli 
distribution. 

Despite its empirical success, regularization by way of multiplicative noise is not well understood 
theoretically, especially for DNNs. The multiplicative noise term eludes analysis as a result of 
being buried within the DNN’s composition of non-linear functions. In this paper, by adopting a 
Bayesian perspective, we show that we can develop closed-form analytical expressions that describe 
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the effect of training with multiplicative noise in DNNs and other models. When a zero-mean 
Gaussian prior is placed on the weights of the DNN, the multiplicative noise variable induces a 
Gaussian scale mixture (GSM), i.e. the variance of the Gaussian prior becomes a random variable 
whose distribution is determined by the multiplicative noise model. Conveniently, GSMs can be 
represented hierarchically with the scale mixing variable—in this case the multiplicative noise— 
becoming a hyperprior. This allows us to circumvent the problematic coupling of the noise and 
likelihood through reparameterization, making them conditionally independent. Once in this form a 
type-II maximum likelihood procedure yields closed-form updates for the multiplicative noise term 
and hence makes the regularization mechanism explicit. 

While the GSM reparameterization and learning procedure are not novel in their own right, employ¬ 
ing them to understand multiplicative noise in neural networks is new. Moreover, the analysis is 
not restricted by the network’s depth or activation functions, as previous attempts at understanding 
dropout have been. We show that regularization via multiplicative noise has a dual nature, forcing 
weights to become either sparse or invariant to rescaling. This result is consistent with, but also 
expands upon, previously-derived adaptive regularization penalties for linear and logistic regression 

urn 

As for its practical implications, our analysis suggests a new criterion for principled model compres¬ 
sion. The closed-form regularization penalty isolated herein naturally suggests a new weight pruning 
strategy. Interestingly, our new rule is in stark disagreement with the commonly used signal-to-noise 
ratio (SNR) QH). The SNR is quick to prune weights with large variances, deeming them noisy, 
but our approach finds large variances to be an essential characteristic of robust, well-fit weights. 
Experimental results on well-known predictive modeling tasks show that our weight pruning mech¬ 
anism is not only superior to the SNR criterion by a wide margin, but also competitive to retraining 
with soft-targets produced by the full network iTQO- In eac h experiment our method was able to 
prune at least 20% more of the model’s parameters than SNR before seeing a vertical asymptote in 
test error. Furthermore, in two of these experiments, the performance of models pruned with our 
method reduced or matched the error rate of the retrained networks until reaching 50% reduction. 

2 Dropout Training and Previous Work 

Below we establish notation for training under multiplicative noise (MN) and review some relevant 
previous work on dropout. In general, matrices are denoted by bold, upper-case variables, vectors 
by bold, lower-case, and scalars by both upper and lower-case. Consider a neural network with L 
total layers (L — 2 of them hidden). Forward propagation consists of recursively computing 

hj = f l (h l _ 1 W l ) (1) 

where h; is the di -dimensional vector of hidden units located at layer Z, h/_ i is the cZ;_i-dimensional 
vector of hidden units located at the previous layer l — 1, fi is some (usually non-linear) element¬ 
wise activation function associated with layer Z, and W; is the di-± x di -dimensional weight ma¬ 
trix. If Z — 1 = 1, then hi_i = x,, a vector of input features corresponding to the /th training 
example out of N, and if l = L, then = y t , the class prediction for the /'th example. For no- 
tational simplicity, we’ll assume the bias term is absorbed into the weight matrix and a constant is 
appended to li/... |. Training a neural network consists of minimizing the negative log likelihood: 
C = — l°g Vi = ^ logp(y|X, W) where p(y|X, W) is a conditional distribution parameter¬ 

ized by the neural network. W is learned through the backpropagation algorithm. 

2.1 Training with Multiplicative Noise 

Training with multiplicative noise (MN) is a regularization procedure implemented through slightly 
modifying Equation Q. It causes the intermediate representation li/ _ | to become stochastically 
corrupted by introducing random variables to the inner product h/_ i W/. Rewriting Equation (JTJ) 
with MN, we have 

h/ = /jpii-iAjW,) (2) 

where A/ is a diagonal di-± x d/_i-dimensional matrix of random variables X h j drawn indepen¬ 
dently from some noise distribution p( A). Dropout corresponds to a Bernoulli distribution on A 

mmi 
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Training proceeds by sampling a new A; matrix for every forward propagation through the network. 
Backpropagation is done as usual using the corrupted values. We can view the sampling as Monte 
Carlo integration over the noise distribution, and therefore, the MN loss function can be written as 

£mn = E p(a) [— logp( y |X,W,A)] (3) 

where the expectation is taken with respect to the noise distribution p( A). At test time, the bias 
introduced by the noise is corrected; for instance, the weights would be multiplied by (1 — p) if we 
trained with Bernoulli^) noise. 

2.2 Closed-Form Regularization Penalties 

Direct analysis of Equation <[3]i for neural networks with non-linear activation functions is currently 
an open problem. Nevertheless, analysis of dropout has received a significant amount of attention 
in the recent literature, and progress has been made by considering second order approximations 
E3- asymptotic assumptions (23l . linear networks HIE?), generative models of the data lETll . and 
convex proxy loss functions f8j. 

Since this paper is primarily concerned with interpreting MN regularization as a closed-form penalty, 
we summarize below the results of (22], which had similar goals, in order to build on them later. A 
closed-form regularization penalty can be derived exactly for linear regression and approximately 
for logistic regression. For linear regression, training under MN is equivalent to training with the 
following penalized likelihood (22l 1231 |3] : 

N d N 

£mnlr = Y^ - x,w) 2 + -Var[A] Y wf Y x'j 3 . (4) 

2 — 1 J —1 2=1 

The second term can be viewed as data-driven £2 regularization in that the weights are being penal¬ 
ized not by just their squared value but also by the sum of the squared features in the corresponding 
dimension. Similarly, an approximate closed-form objective can be found for logistic regression via 
a 2nd-order Taylor expansion around the mean of the noise 11221 : 

N 

^MNLogR « - Y Vi lo § /( X * W ) + (! - Vi) log(i - /(XiW)) 

1=1 (5) 

d N y ’ 

+ ~ Var[A] Y w 2 j Y /( x * w )( 1 “ /(xiw))^. 
j =1 i= 1 

Again we find an £2 penalty adjusted to the data and, in this case, the model’s current predictions. 
However, Helmbold and Long (8] have suggested that this approximation can substantially underes¬ 
timate the error. 

3 Multiplicative Noise as an Induced Gaussian Scale Mixture 

In this section below we go beyond prior work to show that analysis of multiplicative noise (MN) 
regularization can be made tractable by adopting a Bayesian perspective. The key observation is that 
if we assume the weights to be Gaussian random variables, the product A w, where A is the noise and 
w is a weight, defines a Gaussian scale mixture (GSM). GSMs can be represented hierarchically with 
the scale mixing variable—in this case the noise A—becoming a hyperprior. The reparameterization 
works even for deep neural networks (DNNs) regardless of their size or activation functions. 

3.1 Gaussian Scale Mixtures 

First we define a Gaussian scale mixture. A random variable 6 is a Gaussian scale mixture (GSM) 
if and only if it can be expressed as the product of a Gaussian random variable-call it u-with zero 
mean and some variance cj 2 and an independent scalar random variable z 1110]: 

0 = zu ( 6 ) 
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where = denotes equality in distribution. While it may not be obvious from that 0 is a scale 
mixture, the result follows from the Gaussian’s closure under linear transformations, resulting in the 
following marginal density of 9: 



where p(z) is the mixing distribution. Super-Gaussian distributions, such as the Student-t (z 2 ~ 
Inverse Gamma), can be represented as GSMs, and this hierarchical formulation is often used when 
employing these distributions as robust priors m. 

Now that we’ve defined GSMs, we demonstrate how MN can give rise to them. Consider the addition 
of a Gaussian prior to the MN training objective given in Equation Q: 


£gsm = E p(a) [- log(p(y|X, W, A)p(W))] 


where, for a DNN, p(W) = Y\^ = i TTy=i X TlfcLi AT(0, 0 q), i.e., an independent Gaussian prior on 
each weight coefficient with some constant variance <7 q. Next recall the inter-layer computation 
defined in Equation (|2]): 

h* = /,(h^iAjW,) = /((a,). 

a i is a di dimensional vector whose fcth element can be written in summation notation as 

di-i 



Notice that wij t k ~ N (0, <7q) and A ij ~ p{Xj)\ thereby making the product A i,jWi,j,k the definition 
of a GSM given in |(6j. 

The result follows just from application of the definition, but for a more intuitive explanation, con¬ 
sider the case of a constant c multiplied by a Gaussian random variable w ~ N (0, <Tq) as above. The 
product cw is distributed as N( 0, c 2 <7q) due to the Gaussian’s closure under scalar transformation. 
The definition of a GSM (|6]i says that the same result holds even if c is a random variable—the only 
difference being the variance c 2 <Tq is now random itself. See m and m for rigorous treatments. 

3.2 The Hierarchical Parameterization for DNNs 

Here we introduce a key insight: the product between the weights of a DNN and the noise can 
be represented hierarchically, as given in Equation (jTJ, making the intractable likelihood condi¬ 
tionally independent of the noise. Again, the reparameterization follows from the definition, and 
it can be seen graphically in Figure |T] But to elaborate, it’s equivalent (in distribution) to re¬ 
placing the product A~ N(0, UqX 2 • ) with a new conditionally Gaussian random variable 
v i,j,k ~ tV(0, with A ij drawn from the noise distribution. The random rescaling that A ij 

explicitly applied to is still present yet collapsed into the distribution from which ujj.fc is 

drawrQ Because this interaction occurs entirely within the activation function, the complexities 
it introduces do not come into play. The only dependence that needs to be accounted for when 
reparameterizing is the shared variance of all weights occupying the same row of W; (due to the 
noise being sampled for each hidden unit). This poses no serious complications and is actually a 
desirable property, as we discuss later. From here forward, the product form of a GSM is referred 
to as the unidentifiable parameterization—since only the product Acan be identified in the 
likelihood—and the hierarchical form the identifiable parametrization. 

3.3 Dropout’s Corresponding Prior 

We now turn to the case of A ~ Bernoulli(p), the most widely used noise distribution. Moving 
the Bernoulli random variable to the Gaussian random variable’s scale reveals the classic prior for 

’just like it is equivalent, in the previous example using the constant c, to represent the distribution of cw 
with a random variable w* ~ N( 0, c 2 ctq). 
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(a) Unidentifiable (Multiplicative Noise) (b) Identifiable (Hierarchical) 


Figure 1: Equivalent GSM parameterizations for a deep neural network. It is distributionally equiv¬ 
alent to replace the product A w ~ 7V(0, A 2 ) with a new random variable v ~ iV(0, A 2 ). The noise’s 
(A) influence is preserved, flowing through the weight, v, instead of directly into the next layer. 


Bayesian variable selection, the Spike and Slab G3J0: 


P(vi,j,k \\j) 


r ^0 if ^Lj = 0 

\iV(0, (Jq) if X Lj = 1 


( 8 ) 


where So is the delta function placed at zero. Interestingly, the unidentifiable parameterization has 
been used previously for linear regression in the work of Kuo and Mallick 111 2| . They placed the 
Bernoulli indicators directly in the likelihood as follows, 


p 

Vi X! 'V'.r r ''J + <’■ 

i=i 


where 7 j ~ Bernoulli (p = 0.5), essentially defining dropout for linear regression over a decade 
before it was proposed for neural networks. However, Kuo and Mallick were interested in the 
marginal posterior inclusion probabilities p( 7 ? - = l|y) rather than predictive performance. 


4 Type-II ML for the Hierarchical Parameterization 

Having established p(y|X, W, A)p(W) can be written as p(y |X, V)p(V| A), we next wish to iso¬ 
late the characteristics of the weights encouraged by multiplicative noise (MN) regularization. Our 
aim is to write A as a function of V so we can explicitly see the interplay between the noise and 
parameters. To do this, we learn A from the data via a type-II maximum likelihood procedure (a 
form of empirical Bayes). Note that this is hard to do in the unidentifiable parameterization due to 
explaining away na. The identifiable (hierarchical) parameterization, on the other hand, allows for 
an Expectation-Maximizatior0 (EM) formulation, as described in na. The derivation of the EM 
updates is as follows: 

C = -logp(A|y,X) 
oc-log[p(y|X, A)p(A)] 

</, (V) —io g ^i x .vwyiA)P(A )itv 

We make two simplifying assumptions to make working with the posterior manageable. The 
first is, following ca, we choose g(V) = p(V|y, X, A), which corresponds to approximating 
the joint posterior with p(V, A|y, X) « p(V|y, X, A)Jmap(A)- The second assumption is that 
p(V|y, X, A) factorizes over its dimensions. 

Hence, the E-Step is computing 

Qt = Ev|y,x, A [- logp(V|A)p(A)], ( 10 ) 

2 Actually, we perform an equivalent minimization, instead of maximization, in the M-step to keep notation 
consistent with earlier equations. 
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where the likelihood was dropped since it doesn’t depend on A, and the M-Step is 

AJ+ 1 = argmin Q t . 

x ‘,j 


( 11 ) 


In our case, p(V| A) is a fully-factorized Gaussian so the gradient is 


dQ t _ Sfc=i E ^|y,x,Abf ;jife ] di d 

ax u ~ xf j A,., + oa,/ 1 


( 12 ) 


Unfortunately, the EM formulation cannot handle discrete noise distributions (and by extension, 
discrete mixtures) since we can’t calculate if A ij is not a continuous random variable. While 
this does not allow us to address Bernoulli noise (i.e. dropout) exactly, this is not a severe limitation 
for a few reasons. Firstly, as discussed later, the noise distribution encourages particular values for 
Xi j but does not fundamentally change the nature of the regularization being applied to the DNN’s 
weights. Secondly, empirical observations support that our conclusions apply to Bernoulli noise as 
well. Lastly, the Betafct,/?) with a = X < 1 can serve as a continuous proxy for the Bernoulli(0.5). 


5 Analysis of the Regularization Mechanism 


Equation © provides an important window into the effect of multiplicative noise (MN) by reveal¬ 
ing the properties of the weights that influence the regularization. Below we analyze Equation ( p~2| ) 
in detail, showing that multiplicative noise results in weights becoming either sparse or invariant to 
rescaling. We start by setting (12 1 to zero, making the substitution E[u 2 ] = E 2 [v] + Var[u], and 


rearranging to solve for the variance term: 


1 ^ 1 ^ ^ 
x i,j = El ^uiu.A^hj'.fc] + J^Var„| y ,\[vij,k] + nT -p(Xij). (13) 

1 k =1 1 hi 


fc=1 


The first term is the squared posterior mean, and the second is the posterior variance. Both are 
aver aged across weights emanating from the same unit due to the dependence discussed in Section 


3.2 


The third term is the derivative of the noise distribution. Moreover, notice that the 


d\ t 


~P( X l,j) 


term does not contain the DNN’s parameters and therefore only serves as a prior expressing which 
values of A ij are preferred. The regularization pertinent to the network’s parameters is contained in 
the first two terms only. 

In light of this observation, we discard the noise distribution term for the time being and work with 
just the first two empirical Bayesian terms. We can substitute them into the variance of the Gaussian 
prior on V to see what regularization penalty MN is applying, in effect, to the weights: 


^GSM(V) = -logp(V|A) 


L di -1 




E di 

k— 


fc= 1 V l,j,k 


(14) 


a 0 1=2 i =1 d, J2k=l^‘‘t\y t \[ v l,j,k] + ^ aT v\y,\[ v l,j,k] 


Given the Gaussian prior assumption, what results is a sparsity-inducing L 2 penalty whose strength 
is inversely proportional to two factors: the squared mean and variance of the weight under the 
posterior. The posterior mean can be thought of as signal, the strength of the weight, and the variance 
can be thought of as robustness, the scale invariance of the weight. 

To further analyze the properties of ( [T4| , let us assume the current values of the weights are near 
their posterior means: vij^ ~ E[fz,j, fel- ’This assumptions simplifies (14 1 to 


^•gsm(V) 


L di_ i 

EE 


a 0 ^ ^ l _|_ ELLi Var„| y,x[vi,3,k] 


(15) 


^k = l u l,j,k 

The fractional term in the denominator, Ylt= t Var„| y ,\[vi,j,k]/ Ylk =t v ij fc’ represents two alterna¬ 
tive paths the weights can take to reduce the penalty during training. The DNN must either send 
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Y^k -1 v i j k 0 or Yk =l Var w | yi x[ v l,j,k\ y oo. The former occurs when weights become sparse, 
and the latter occurs when weights are robust to rescaling (i.e. they do not have to be finely cal¬ 
ibrated). Hence, we observe a dual effect not seen in traditional sparsity penalties. MN allows 
weights to grow without restraint just so long as they are invariant to rescaling. If not, they are 
shrunk to zero. 


Thinking back to how MN regularization is usually carried out in practice (namely, by Monte Carlo 
sampling within the likelihood), we see that training in this way is essentially finding the invari¬ 
ant weights by brute force. The only way the negative log likelihood can reliably be decreased 
is by pruning weights that cannot withstand being tested at random scales. Dropout obscures this 
fact to some degree by being a discrete mixture over just two scales, zero and one. The superior 
performance of continuous distributions, observed both in fl7] |20| and further supported in our 
supplemental materials, may be due to searching over a richer, infinite scale space. 


On a final note, the closed-form dropout penalties from equations (|4]» and ([5]) can be recovered from 
1Zgsm(V) by 1 assuming the Gaussian prior necessary for our analysis be diffuse and therefore 
negligible, and 2 the posterior mean is the same as the prior mean, which is necessary due to Wager 
et al. 1221 performing the Taylor expansion around the mean. This removes the E 2 [i>] term from the 
denominator of 1Zgsm(V) (14 1 . Interestingly, this modification results in (14) becoming 


TtGSMReg(y) = ~ 


d 


y2 j 


a 2 Q Varfuj] ’ 


(16) 


which is the inverse of the term we isolated in Equation 0 as capturing the nature of MN regu¬ 
larization. The resulting behavior is the same since we found the term in the denominator. See the 
supplementary material for the details of the derivation. Wager et al. interpreted their findings as an 
L 2 scaled by the inverse diagonal Fisher Information. Yet, via the Cramer-Rao lower bound, their 
result could also be seen as an L 2 scaled by the asymptotic variance of the weights. A notion of 
variance, then, is just as integral to their frequentist derivation as it is our Bayesian one. 


6 Experiments: Weight Pruning 

We conducted a number of experiments to empirically investigate if our results present new direc¬ 
tions for algorithmic improvements in training DNNs. We implemented the EM algorithm derived 
in Section fusing Langevin Dynamics l25l . an efficient stochastic gradient technique for collecting 
posterior samples, to calculate the posterior moments needed for the M-Step. We found that we 
could not outperform Monte Carlo MN regularization for any of the deep architectures with which 
we experimented (see supplemental materials). We conjecture that the practical issue of computing 
the posterior moments was likely the bottleneck, which is to be expected given that developing ef¬ 
ficient Bayesian learning algorithms for DNNs is a challenging and open problem in and of itself 
00 . 

However, we did find immediate and practical benefits in the context of model compression mm. 
Our conclusions about how MN regularizes DNNs conspicuously differ from the signal-to-noise 
ratio (SNR) for weight pruning tasks, as used by 0 and more recently by Q. With this in mind 
we carried out a series of weight pruning experiments for the dual purpose of validating our analysis 
and providing a novel weight pruning rule (that turns out to be superior to the SNR). 

The SNR heuristic is defined by the following inequality: < r where \pij,k\ i s the absolute 

value of the posterior mean of weight vij t k, &i,j,k. is the posterior standard deviation of the same 
weight, and r is some positive constant. Pruning is carried out by setting to zero all weights for 
which the inequality holds (i.e. \p\/a is below some threshold r). Blundell et al. [0 ran experiments 
using the SNR and stated it “is in fact related to test performance.” 

Now consider our alternative method. Recall that the terms in the denominator of Equation ( [T4| ) are 
E 2 [u] + Var[v]. Our analysis shows that MN deems weights with large means and large variances as 
being high quality, turning off the sparsity penalty applied to them. This conclusion conflicts with 
the SNR since using \p\/a prunes weights with large variances first. Thus we propose the following 
competing heuristic we call signal-plus-robustness (SPR): 

\lH,j,k\ + cn,j,k < T (17) 
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(a) MNIST, 500-300 Hidden Units 



(c) MSD, 120 Hidden Units 


(b) IMDB, 1000 Hidden Units 



Squared Mean 

(d) Posterior Weight Moments 


Figure 2: Experimental results: weight pruning task (a,b,c) and empirical moments (d). 


where the terms are defined the same as above. 

We experimentally compared both pruning rules on three datasets, each with very different charac¬ 
teristics. The first is the well-known MNIST dataset (d = 784, N = 50k/10k), the second is the 
large IMDB movie review dataset for sentiment classification lfl4l (d = 5000, N = 25k/25k), and 
the third is a prediction (regression) task using features preprocessed from the Million Song Dataset 
(MSD) lfl3l (d = 90, N = 460fc/50fc). We trained the networks with Bernoulli MN and when con¬ 
vergence was reached, switched to Langevin Dynamics (with no MN) to collected 10,000 samples 
from the posterior weight distribution of each network 1251 (e ~ iV(0, lr/2) where Ir is the learning 
rate). A polynomial decay schedule was set by validation set performance. 

We ordered the weights of each network by SNR and SPR and then removed weights (i.e. set 
them to zero) in increasing order according to the two rules. Plots showing test error (number of 
errors, error rate, mean RMSE) vs. percentage of weights removed can be seen in panels (a),(b) 
and (c) of Figure [2] For another source of comparison, we also show the performance of a network 
(completely) retrained on the soft-targets CD produced by the full networl0 To make comparison 
fair, the retrained networks had the same depth as the one on which pruning was done, splitting the 
parameters equally between the layers. 

We see that our rule, SPR (|/r| + cr), is clearly superior to SNR (|/r| /a). We were able to remove 
at least 20% more of the weights in each case before seeing a catastrophic increase in test error. 
The most drastic difference is seen for the IMDB dataset in (b), which we believe is due to the 
sparsity of the features (word counts), exaggerating SNR’s preference for overdetermined weights. 
Our method, SPR, even outperformed retraining with soft-targets until at least a 50% reduction in 
parameters was reached. Finally, further empirical support of our findings, a scatter plot showing 
the first two moments of each weight for two networks-one trained with Bernoulli MN and the 
other without MN-can be see in panel (d) of Figure [2] We produce the figure to show that although 


'No soft-target results are shown for (c), the MSD year prediction task, as we found training with soft-targets 
does not have the same benefits for regression it does for classification 
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our closed-form penalty technically doesn’t hold for discrete noise distributions (due to the need 
to compute the gradient), the analysis (sparsity vs scale robustness) most likely extends to discrete 
mixtures. 


7 Conclusions 

This paper improves our understanding of how multiplicative noise regularizes the weights of deep 
neural networks. We show that multiplicative noise can be interpreted as a Gaussian scale mixture 
(under mild assumptions). This perspective not only holds for neural networks regardless of their 
depth or activation function but allows us to isolate, in closed-form, the weight properties encour¬ 
aged by multiplicative noise. From this penalty we see that under multiplicative noise, the network’s 
weights become either sparse or invariant to rescaling. We demonstrated the utility of our findings 
by showing that a new weight pruning rule, naturally derived from our analysis, is significantly more 
effective than the previously proposed signal-to-noise ratio and is even competitive to retraining with 
soft-targets. 
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